Assignment 1 Full Writeup
Introduction and Literature Review
In the socio-political landscape, the manner and content of how leaders communicate provides critical insight into their governance style, priorities, and ideology. The State of the Nation Address (SONA) serves as an essential touchstone in South Africa’s political calendar, where the sitting president not only provides an annual report on the nation’s status but also sets the tone for policy directions and government intent for the subsequent year. Delivered at the commencement of a joint sitting of Parliament, particularly in election years, this address receives heightened scrutiny, given that it occurs twice: pre- and post-election (African 2023).
Natural Language Processing (NLP) has been increasingly leveraged in the domain of political science to uncover patterns, biases, and ideologies in the speeches and writings of political leaders. Recent advancements in machine learning and NLP tools have enabled more refined text analysis, going beyond mere word frequency to semantic content and stylistic nuances. Researchers such as Katre (2019) and Glavas, Nanni and Ponzetto (2019) have demonstrated the efficacy of using NLP to categorise and analyse political speeches. This raises the question: Can we discern, based purely on textual analysis, which South African president might have uttered a particular sentence during their SONA speech? In other words - can we predict the author of a sentence based on the content and style of the sentence?
However, while there is an abundance of literature on NLP applications in sentiment analysis and topic modelling, its application to discern between specific authors or speakers, especially in the South African political sphere, remains largely unexplored (although there has been some development with regards to creating text resources for South African languages (Eiselen and Puttkammer 2014)). This gap is particularly noticeable when considering the unique linguistic, cultural, and political landscape of South Africa. The challenges lie not just in the variety of linguistic styles but also in the depth and breadth of topics covered, as well as the personal idiosyncrasies of each president (within a single speech as well as over time).
Given the above context, this paper aims to predict which of the South African presidents between 1994 and 2022 might have said a specific sentence during their SONA address. It leverages various text transformation techniques, such as Bag of Words (BoW), Term Frequency-Inverse Document Frequency (tf-idf), and text embeddings (a very simple embedding as well as BERT). Subsequent application of machine learning models, including a feed-forward neural net, Support Vector Machine (SVM), Naive Bayes, and a BERT classification model, offers a comparative lens to evaluate the efficacy of each approach.
Data Preparation
The dataset (Republic of South Africa 2023) was comprised of a series of text files of the State of the Nation Addresses (SONA) from 1994 through 2022. Each speech’s content was subsequently ingested, omitting the initial lines. These speeches were then collated into a structured format for more convenient access and manipulation.
Subsequently, essential metadata, including the year of the address and the name of the delivering president, were gleaned. Ater that, the removal of URLs, HTML character codes, and newline characters was performed. Additionally, the date of each address was extracted and appropriately formatted.
To achieve the project’s objectives, each speech was dissected into its individual sentences. This granular breakdown facilitated the mapping of each sentence to its originating president. The finalised structured dataset comprises individual sentences paired with their respective presidents. This dataset was also saved as a csv file for future use.
For the model building, the data was prepared by create a 70-15-15 train-validation-test split, with the same seed being used for each method to ensure fair comparisons.
Number of speeches per president
The bar plot above illustrates the total number of speeches given by each president. Mbeki and Zuma had most speeches in the dataset, with 10 each. This means that there’s a substantial amount of data available for them, which could be advantageous when discerning their linguistic patterns, given that there is not a significant overlap in the sentences of the two presidents. Motlanthe and de Klerk only had one speech each, which may be an issue, due to an imbalance in the data, which may bias the model output later. To explore this further, the number of sentences per president is examined.
Number of sentences per president
The plot above gives a breakdown of the number of sentences spoken by each president. Zuma stands out with the most sentences, further underscoring his prominence in the dataset. Notably, while Mbeki gave three more speeches than Ramaphosa, their sentence count is nearly the same, implying that Ramaphosa’s speeches might be more verbose or detailed. This data provides a deeper understanding of the granularity of each president’s contribution and reaffirms the potential data imbalance to be addressed in model development, especially when considering the fact that de Klerk and Motlanthe have less than 300 sentences each, while the others have well over 1500.
Average sentence length per president
This plot unveils the average sentence length, in words, for each president. A striking observation is that Zuma, despite having the most sentences and speeches, has a relatively concise average sentence length. Conversely, Mbeki and Motlanthe have longer average sentence lengths, with Mbeki being the only president that had over 30 words per sentence, on average. This metric offers insights into the verbosity and style of each president, which can be a useful feature when discerning speech patterns in model building.
Word clouds for each president
The word clouds above offer a visually compelling representation of the most frequently used words by each president. The size of each word in the cloud corresponds to its frequency in the speeches. All the presidents had “will” as their most prominent word and referred to the country many times while speaking (highlighted by the use of the words “south” and “africa”/“african”). Motlanthe seemed to focus more on the economy and public image with the use of words such as “national”, “public” and “government”, whereas Mandela seemed to focus more on the people with the use of words such as “people” and “us”. de Klerk focused more on the constitution and forming alliances during a transitional period, and Zuma focused more on work and the development. These word clouds provide a snapshot of the focal points and themes of each president’s speeches. Distinctive words or terms can be potential features when building predictive models. The words from the wordclouds can also be seen in the bar plots below.
Word frequency distribution for each president
N-gram frequency distributions for each president
Bigrams
Instead of only looking at single word frequency, bigrams can also be used to find the most common two-word phrases. The bigrams above elucidate the distinctive linguistic patterns and thematic foci of each president, presenting opportunities for differentiation. For instance, President Mandela’s frequently used bigrams, such as “South Africans” and “national unity,” reflect his emphasis on nation-building and reconciliation during his tenure. In contrast, President Zuma’s bigrams like “economic growth” suggest a policy-driven discourse concentrated on economic dynamics. However, there are potential pitfalls. Overlapping or common bigrams across presidents, such as generic terms or phrases prevalent in political discourse, could introduce ambiguity, potentially hindering the model’s precision. Additionally, while President Ramaphosa’s bigrams like “South Africa” are distinctly frequent, they are not uniquely attributable to him, as such phrases are likely universal across South African presidencies.
Trigrams
Expanding on the analysis of linguistic markers, trigrams offer insights into the most recurrent three-word sequences employed by each president. The trigram outputs above further refine our understanding of the unique verbal choices and thematic concerns of each leader. For instance, President Mandela’s recurrent trigrams, such as “trade union movement”, underscore his consistent focus on the working class of South Africa. Meanwhile, President Zuma’s trigrams, such as “expaned public works” indicate a focus on the public sector as a whole. Conversely, the presence of generic or universally applicable trigrams, such as “state nation address”, might pose challenges. These broadly-used trigrams, inherent to political addresses across presidencies, might dilute the distinctive features of individual presidents, complicating the model’s task. Moreover, trigrams like “south africa will” from President Ramaphosa, although salient, are emblematic of speeches common to all presidents, making them less distinguishing. Thus, while trigrams can accentuate the nuances of each president’s discourse, the model would benefit from discerning the balance between distinctiveness and generic trigram usage.
Sentence similarity between presidents
2023-10-18 11:41:09.421719: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Bag of Words (BOW) representation
The Bag-of-Words (BoW) visualisation above reveals a pronounced central cluster with substantial overlap across presidential sentences, indicating pervasive shared linguistic elements. This convergence towards common terms suggests that the BoW representation predominantly captures universal themes and terminologies characteristic of political discourse. Such patterns, while illuminating shared linguistic tendencies, underscore potential challenges in predictive modeling, with the BoW approach possibly lacking the granularity to detect distinctive linguistic markers for each president.
TF-IDF representation
Using the TF-IDF representation, the visualization depicts a dominant central cluster, reaffirming the presence of overlapping linguistic constructs across presidential discourses. Unlike the BoW representation, the TF-IDF visualization lacks discernible smaller clusters, and data points appear more dispersed. This dispersion underscores the varied thematic undertones each president might have explored, but the pronounced overlap in the central region suggests that these thematic variations are not sufficiently distinct in the TF-IDF space to provide clear demarcations. The observed patterns emphasize the challenges inherent in solely relying on TF-IDF for capturing the unique linguistic nuances of each president.
Tokenization with Padding representation
Utilising tokenization with padding, the resultant visualization presents multiple clusters, indicating the method’s ability to recognize shared linguistic constructs or thematic groupings within the dataset. Notably, the significant intermingling of presidents within these clusters underscores the shared nature of discourse patterns across different presidencies. The absence of a dominant central cluster, a divergence from the BoW and TF-IDF representations, alludes to a more nuanced and diverse sentence representation in the embedding space, potentially attributed to the emphasis on sentence structure inherent in the tokenization method.
Methods
1. Text Representation Techniques
a. Bag-of-Words (BoW)
The Bag-of-Words (BoW) representation is a simplistic yet effective method for text data representation. It hinges on representing text by its constituent words, disregarding their order. Here, each word operates as a feature, with the text being represented by a vector that denotes the frequency of each word (V M and Kumar R 2019).
Formally, given a vocabulary \(V\) comprising \(N\) unique words, each document \(d\) can be depicted as a vector \(\mathbf{v}_d\) in \(\mathbb{R}^N\) , where the i-th element \(v_{d,i}\) denotes the frequency of the i-th word in the document:
\[ \mathbf{v}_d = [v_{d,1}, v_{d,2}, \ldots, v_{d,N}] \]
The dataset was transformed into a BoW representation with each row corresponding to a sentence, and each column reflecting the frequency of a word in that sentence. The CountVectorizer class from the sklearn.feature_extraction.text module was employed for this task, with English stop words being excluded to filter out prevalent words that lack significant meaning, such as “and”, “the”, and “is” (Pedregosa et al. 2011).
b. Term Frequency-Inverse Document Frequency (TF-IDF)
Contrastingly, the TF-IDF representation scales the frequency of words based on their occurrence across all documents, ensuring that words appearing too frequently across documents (potentially bearing lesser discriminative importance) are assigned lower weights “Understanding TF-IDF: A Simple Introduction” (n.d.).
The term frequency (TF) of a word in a document is the raw count of that word in the document. The inverse document frequency (IDF) of a word is defined as:
\[ \text{IDF}(w) = \log \left( \frac{N}{1 + \text{count}(w)} \right) \]
where \(N\) signifies the total number of documents and \(\text{count}(w)\) represents the number of documents containing the word \(w\). The TF-IDF value for a word in a document is then the product of its TF and IDF values “Understanding TF-IDF: A Simple Introduction” (n.d.).
The TfidfVectorizer class from the sklearn.feature_extraction.text module was employed to transform our dataset into this representation (Pedregosa et al. 2011).
c. Text Embedding
For processing by deep learning models like neural networks, textual data was tokenized and converted into sequences of numbers. The Tokenizer class from the keras.preprocessing.text module was utilized for this purpose. Subsequently, sentences were padded with zeros using pad_sequences from the keras.preprocessing.sequence module to ensure uniform length (Chollet et al. 2015).
2. Model Architectures and Training
a. Feed-Forward Neural Network
Feed-forward neural networks (FFNNs) are a subset of artificial neural networks characterized by acyclic connections between nodes. They encompass multiple layers: an input layer, several hidden layers, and an output layer (Rumelhart, Hinton, and Williams 1986).
The architecture of the neural network employed in this study is delineated as follows:
- Input Layer: This layer harbors neurons equal to the number of features in the dataset (word counts for BoW and TF-IDF, sequence length for text embeddings). The Rectified Linear Unit (ReLU) activation function was utilized owing to its efficiency and capability to mitigate the vanishing gradient issue:
\[ f(x) = \max(0, x) \]
Hidden Layers: Several hidden layers were introduced, each utilizing He initialization, which is proficient for layers with ReLU activation. A dropout layer succeeded each hidden layer to curb overfitting by randomly nullifying a fraction of input units during each training update.
Output Layer: This layer contains neurons equal to the number of classes (presidents, in our scenario). The softmax function was employed as the activation function, generating a probability distribution over the classes:
\[ \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]
for \(i = 1, \ldots, K\) and \(\mathbf{z}\) is the input vector to the softmax function.
Training was conducted using the Adam optimization algorithm with a learning rate of 0.001. Adam is adept at training deep neural networks via computing adaptive learning rates for each parameter, leveraging moving averages of the parameter gradients and squared gradients.
The EarlyStopping and ReduceLROnPlateau callbacks were also enlisted. The former halts the training process if validation loss ceases to improve for a stipulated number of epochs, while the latter diminishes the learning rate if the validation loss reaches a plateau (Chollet et al. 2015).
b. Support Vector Machine (SVM)
The Support Vector Machine (SVM) is a supervised learning algorithm suitable for both classification and regression tasks. It operates by identifying the optimal hyperplane that segregates a dataset into distinct classes. Provided a set of training examples, each labeled as belonging to one of two categories, the SVM training algorithm constructs a model that categorizes new examples into one of the two categories (Cortes and Vapnik 1995).
Mathematically, given labeled training data \((x_1, y_1), \ldots, (x_N, y_N)\) where \(x_i\) belongs to \(\mathbb{R}^D\) and \(y_i\) is either 1 or -1 (indicating the class the input \(x_i\) belongs to), SVM seeks the hyperplane defined by \(w\) and \(b\) that optimally separates the data points of the two classes (Cortes and Vapnik 1995):
\[ y_i(w \cdot x_i + b) \geq 1 \]
The objective of SVM is to maximize the margin, which is the distance between the hyperplane and the nearest point from either class. The decision function is then given by:
\[ f(x) = \text{sign}(w \cdot x + b) \]
c. Naive Bayes Classifier
Naive Bayes is a probabilistic classifier predicated on Bayes’ theorem with strong (naive) independence assumptions among features (Raschka 2014). Given a set of features \(X = x_1, \ldots, x_n\) and a class variable \(C\), Bayes’ theorem states:
\[ P(C|X) = \frac{P(X|C) \times P(C)}{P(X)} \]
The Naive Bayes classifier posits that the effect of a particular feature in a class is independent of other features. This simplification expedites computation, hence the term ‘naive’ (Raschka 2014).
In our problem, the Naive Bayes classifier estimates the probability of a sentence belonging to each president’s class based on the features (word frequencies for BoW or TF-IDF values). The sentence is then classified to the class (president) with the highest probability.
3. Model Evaluation
Evaluating the performance of machine learning models is paramount as it unveils the efficacy of the model and areas of potential improvement. Our evaluation paradigm leverages standard metrics including accuracy, precision, recall, and F1 score to quantify various facets of the model’s predictions in a multi-class classification setting such as ours, where predictions could be true or false for multiple classes (presidents, in this case).
a. Accuracy
Accuracy furnishes a broad overview of the model’s performance and is calculated as the ratio of correct predictions to the total predictions:
\[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]
Nonetheless, in imbalanced datasets, accuracy could be misleading.
b. Precision
Precision scrutinizes the model’s positive predictions. Specifically, it computes the frequency at which the model correctly predicted a specific president out of all predictions for that president:
\[ \text{Precision (for a given president)} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]
Where:
True Positives (TP): The number of sentences correctly identified as belonging to that president.
False Positives (FP): The number of sentences erroneously identified as belonging to that president, while they belong to a different one.
Precision is particularly crucial in scenarios where the cost of a false positive is high.
c. Recall (or Sensitivity)
Recall evaluates how effectively the model identifies sentences from a specific president. It calculates the proportion of actual sentences from a president that the model correctly identified:
\[ \text{Recall (for a given president)} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]
Where:
- False Negatives (FN): The number of sentences that genuinely belong to a president but were misclassified as belonging to another.
Recall is vital in contexts where missing a true instance is significant.
d. F1 Score
The F1 score is the harmonic mean of precision and recall, providing a balance between them. It achieves its best value at 1 (perfect precision and recall) and its worst at 0:
\[ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]
The F1 score is particularly useful when there is an uneven data distribution among classes.
These metrics were computed for each president in our dataset and then averaged (weighted by the number of true instances for each president) to derive a single value representing the overall model’s performance. This approach ensures that the model’s aptitude to predict less frequent classes (presidents with fewer sentences) is considered, rendering the evaluation more robust and representative of the model’s true capabilities in a multi-class setting.
Moreover, the models were also assessed on separate training and test datasets. The training dataset is the learning corpus for the model, while the test dataset presents a fresh, unseen set of data points to gauge the model’s generalization to new data. This separation is pivotal to ensure that the model doesn’t merely memorize the training data (overfitting), but discerns the underlying patterns determining which president uttered a given sentence.
Results
Bag of Words
In the exploration of various models on a training set employing a bag of words representation, distinct performance disparities were observed. Firstly, utilising a feed-forward neural network, an impressive training accuracy of 0.99 was attained. However, its validation accuracy registered at 0.6, hinting at potential overfitting. An analysis of the test set predictions, as depicted by the confusion matrix, demonstrated that the correct classes were predominantly predicted for the corresponding sentences. Notably, the test set metrics were: precision at 0.595, recall at 0.569, and the f1 score also at 0.569.
The employment of support vector machines (SVM) with the bag of words approach, post-tuning, yielded training and validation accuracies of 0.989 and 0.533 respectively, with the test accuracy being 0.55. The precision, recall, and f1 scores for this model stood at 0.551, 0.55, and 0.547 respectively.
Lastly, when Naive Bayes was paired with the bag of words method, post-optimisation, it achieved training and validation accuracies of 0.89 and 0.592 respectively. This model appeared less prone to overfitting compared to its counterparts. For the test set, precision was measured at 0.624, recall at 0.615, and interestingly, the test accuracy was noted to be 0.616, slightly higher than the recall.
Neural network
| Training Accuracy | Validation Accuracy |
|---|---|
| 0.994 | 0.601 |
| Metric | Value |
|---|---|
| Precision | 0.595 |
| Recall | 0.575 |
| F1 Score | 0.569 |
Support Vector Machine
| Hyperparameter Value | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 10.000 | 0.989 | 0.533 | 0.550 |
| C | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.001 | 0.295 | 0.294 | 0.294 |
| 0.010 | 0.304 | 0.309 | 0.306 |
| 0.100 | 0.391 | 0.322 | 0.325 |
| 0.500 | 0.748 | 0.425 | 0.445 |
| 1.000 | 0.920 | 0.527 | 0.566 |
| 10.000 | 0.989 | 0.533 | 0.550 |
| 100.000 | 0.995 | 0.498 | 0.517 |
A table of hyperparameter tuning results for the SVM model trained on the Bag of Words representation, with the best hyperparameters highlighted above.
| Metric | Value |
|---|---|
| Precision | 0.551 |
| Recall | 0.55 |
| F1 Score | 0.547 |
Naive Bayes
| Hyperparameter Value | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.100 | 0.890 | 0.592 | 0.615 |
| alpha | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.001 | 0.906 | 0.572 | 0.597 |
| 0.010 | 0.904 | 0.581 | 0.608 |
| 0.100 | 0.890 | 0.592 | 0.615 |
| 1.000 | 0.838 | 0.586 | 0.611 |
| 10.000 | 0.637 | 0.494 | 0.523 |
A table of hyperparameter tuning results for the NB model trained on the Bag of Words representation, with the best hyperparameters highlighted above.
| Metric | Value |
|---|---|
| Precision | 0.624 |
| Recall | 0.615 |
| F1 Score | 0.616 |
TF-IDF
In a subsequent analysis utilising the term frequency-inverse document frequency (tf-idf) representation with various models, certain resemblances to the bag of words (bow) results were discerned. Firstly, with the feed-forward neural network, the accuracy plot bore a striking similarity to its bow counterpart. This model achieved a training accuracy of 0.99 and a validation accuracy of 0.588. The confusion matrix for test set predictions indicated that the majority of sentences were assigned their correct classes. The test set metrics recorded were: precision at 0.598, recall at 0.597, and the f1 score at 0.595.
When the support vector machines (SVM) were employed in tandem with the tf-idf representation, the accuracy plot was found to mirror that of the bow version. After tuning, the training accuracy registered at 0.968, with validation and test accuracies being 0.542 and 0.574, respectively. The precision, recall, and f1 scores for this model were 0.58, 0.574, and 0.573 in that order.
Lastly, the Naive Bayes model with the tf-idf approach displayed accuracy plots bearing a resemblance to the bow version. Post-optimisation, it yielded training, validation, and test accuracies of 0.915, 0.594, and 0.611 respectively. The precision and recall both stood at 0.611, whilst the test accuracy was slightly lower at 0.609.
Neural network
| Training Accuracy | Validation Accuracy |
|---|---|
| 0.994 | 0.588 |
| Metric | Value |
|---|---|
| Precision | 0.598 |
| Recall | 0.597 |
| F1 Score | 0.595 |
Support Vector Machine
| Hyperparameter Value | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 1.000 | 0.968 | 0.542 | 0.574 |
| C | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.001 | 0.295 | 0.294 | 0.294 |
| 0.010 | 0.300 | 0.296 | 0.288 |
| 0.100 | 0.364 | 0.306 | 0.301 |
| 0.500 | 0.832 | 0.435 | 0.439 |
| 1.000 | 0.968 | 0.542 | 0.574 |
| 10.000 | 0.995 | 0.531 | 0.550 |
| 100.000 | 0.996 | 0.528 | 0.544 |
A table of hyperparameter tuning results for the SVM model trained on the TF-IDF representation, with the best hyperparameters highlighted above.
| Metric | Value |
|---|---|
| Precision | 0.58 |
| Recall | 0.574 |
| F1 Score | 0.573 |
Naive Bayes
| Hyperparameter Value | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.100 | 0.915 | 0.594 | 0.611 |
| alpha | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.001 | 0.931 | 0.575 | 0.594 |
| 0.010 | 0.929 | 0.584 | 0.605 |
| 0.100 | 0.915 | 0.594 | 0.611 |
| 1.000 | 0.812 | 0.575 | 0.590 |
| 10.000 | 0.622 | 0.500 | 0.519 |
A table of hyperparameter tuning results for the NB model trained on the TF-IDF representation, with the best hyperparameters highlighted above.
| Metric | Value |
|---|---|
| Precision | 0.611 |
| Recall | 0.611 |
| F1 Score | 0.609 |
Token Embeddings
Upon utilising text embedding as a representation technique alongside various models, a marked degradation in performance was observed compared to other preprocessing methods. With the feed-forward neural network, the training accuracy was a mere 0.409, while the validation accuracy dropped further to 0.369. The confusion matrix for test set predictions was quite telling: for a majority of sentences, the correct classes were not discerned. Intriguingly, the class “Zuma” was predominantly predicted. The test set showcased a precision of 0.367, recall of 0.368, and a notably lower f1 score of 0.328.
When paired with the support vector machines (SVM), post-tuning, the training accuracy stood at 0.406, with validation and test accuracies of 0.361 and 0.347, respectively. The precision was 0.342, the recall was 0.347, and the f1 score was slightly lower at 0.319.
Incorporating the Naive Bayes model with text embedding, post-optimisation, the training, validation, and test accuracies were 0.359, 0.359, and 0.338 in that order. This model’s precision and recall registered at 0.335 and 0.338 respectively, with the test accuracy being considerably reduced to 0.291. This underlines the challenge posed by text embeddings in this specific context, as the results were notably inferior to other data preparation methods.
Neural network
| Training Accuracy | Validation Accuracy |
|---|---|
| 0.409 | 0.369 |
| Metric | Value |
|---|---|
| Precision | 0.367 |
| Recall | 0.368 |
| F1 Score | 0.328 |
Support Vector Machine
| Hyperparameter Value | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.500 | 0.406 | 0.361 | 0.347 |
| C | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.001 | 0.257 | 0.257 | 0.257 |
| 0.010 | 0.300 | 0.300 | 0.292 |
| 0.100 | 0.368 | 0.340 | 0.335 |
| 0.500 | 0.406 | 0.361 | 0.347 |
| 1.000 | 0.451 | 0.353 | 0.341 |
| 10.000 | 0.574 | 0.324 | 0.338 |
| 100.000 | 0.687 | 0.334 | 0.327 |
A table of hyperparameter tuning results for the SVM model trained on the text embedding representation, with the best hyperparameters highlighted below.
| Metric | Value |
|---|---|
| Precision | 0.342 |
| Recall | 0.347 |
| F1 Score | 0.319 |
Naive Bayes
| Hyperparameter Value | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.001 | 0.359 | 0.359 | 0.338 |
| 0.001 | 0.359 | 0.359 | 0.338 |
| 0.001 | 0.359 | 0.359 | 0.338 |
| 0.001 | 0.359 | 0.359 | 0.338 |
| 0.001 | 0.359 | 0.359 | 0.338 |
| alpha | Training Accuracy | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| 0.001 | 0.359 | 0.359 | 0.338 |
| 0.010 | 0.359 | 0.359 | 0.338 |
| 0.100 | 0.359 | 0.359 | 0.338 |
| 1.000 | 0.359 | 0.359 | 0.338 |
| 10.000 | 0.359 | 0.359 | 0.338 |
A table of hyperparameter tuning results for the NB model trained on the text embedding representation, with the best hyperparameters highlighted above.
| Metric | Value |
|---|---|
| Precision | 0.335 |
| Recall | 0.338 |
| F1 Score | 0.291 |
BERT Embeddings with pre-trained classifier
The code for this section was adapted from the following source: Google Tensorflow BERT tutorial
Utilising the BERT embedding in tandem with a pre-trained model, a strategy known as transfer learning, distinctive patterns in performance were observed. Throughout the training epochs, the training accuracy showcased a consistent uptick. However, the validation accuracy plateaued rather swiftly, exhibiting minimal fluctuations thereafter. At the culmination of the training, the accuracy metrics stood as follows: training accuracy at 0.759, validation accuracy at 0.684, and a slightly higher test accuracy of 0.712. Further delving into the test set metrics, the precision was 0.71, recall was 0.707, and the f1 score was close behind at 0.708. An examination of the confusion matrix for the test set underscored these findings. The model predominantly made accurate predictions for the respective presidents, mirroring the positive metrics mentioned earlier. This highlights the efficacy of the BERT embeddings and transfer learning in this particular context, as the results were substantially more favourable than some other methods previously explored.
| accuracy | precision | recall | f1_score | |
|---|---|---|---|---|
| Training | 0.759 | 0.507 | 0.837 | 0.503 |
| Validation | 0.684 | 0.548 | 0.744 | 0.543 |
| Test | 0.712 | 0.710 | 0.707 | 0.708 |
Discussion:
This study aimed to figure out which of the South African presidents, from 1994 to 2022, might have said certain sentences during their State of the Nation Address (SONA). Different ways of processing the text, such as Bag of Words (BoW), Term Frequency-Inverse Document Frequency (tf-idf), and text embeddings, were used. These methods were then paired with machine learning models to see which combination worked best.
With the Bag of Words (BoW) method, the feed-forward neural network did well in training but not as well in validation, suggesting it might not do well with new, unseen data. The SVM and Naive Bayes models had similar outcomes. The tf-idf method gave results close to BoW for the neural net and SVM, but Naive Bayes seemed a bit more stable. However, simple text embeddings didn’t work as well across the board. This could be because these embeddings might be too basic to capture the unique way presidents speak in their SONA addresses.
On the other hand, using BERT embeddings with a pre-trained model gave us some hope. The model kept getting better during training, and its test results were the best among all the models. This suggests that using advanced methods like BERT might be the way forward for such tasks.
Conclusion:
This study shows how important it is to pick the right method to process text and the right model to analyse it. While methods like BoW and tf-idf gave decent results, simple text embeddings didn’t do as well. But, the combination of BERT embeddings and a pre-trained model stood out.
This has two main takeaways. First, for researchers looking into political speeches, these models can help in figuring out who might have said an unattributed speech. Second, for those into machine learning, it highlights the growing role of advanced methods like BERT.
Looking ahead, it might be worth exploring even better text processing methods or fine-tuning models like BERT for even more accurate results. Overall, this study shows the exciting possibilities when combining tech with the study of political speeches.